Summary

This document uses unnormalized TSS counts to test for differential usage of TSS’s on the same gene between conditions.

It starts with some general checks on the number of sites and number of regions used per gene.

It continues with a heuristic estimate by correlation analysis: are the levels of alt TSS’s correlated between conditions.

The highlight is the use of a negative binomial count model to assess the differential TSS usage between conditions. This model is explained in detail. We do a multiple comparisons correction at 1% FDR.

All the analysis is done on JEC21 data first. Then on H99 data, starting again from the beginning.

JEC21

Load TSS cts

Plot number of distinct counts (and genomic locations) by Gene name

Correlation analysis

There is a wide range of correlations between the top-2 most-used TSS’s

Why? Many negative correlations.

Plot counts for some genes with Highly-correlated TSS usage

## # A tibble: 50 x 8
##    Gene_name  cor2 nSites nLocations   Tot cdsAlt utr5Altx utr3Altx
##    <chr>     <dbl>  <int>      <int> <dbl> <lgl>  <lgl>    <lgl>   
##  1 CNC06940  0.952      2          2 69740 TRUE   FALSE    FALSE   
##  2 CNI00990  0.978      3          3 16051 TRUE   FALSE    FALSE   
##  3 CND00150  0.971      6          2 15508 TRUE   FALSE    FALSE   
##  4 CNC06440  0.953      3          2 13622 FALSE  TRUE     FALSE   
##  5 CND03880  0.965      2          2 11819 TRUE   FALSE    FALSE   
##  6 CNN00300  0.958      2          2  7200 TRUE   FALSE    FALSE   
##  7 CNH02910  0.983      2          2  7084 TRUE   FALSE    FALSE   
##  8 CNK01720  0.997      9          4  6182 TRUE   TRUE     FALSE   
##  9 CNB05670  0.980      6          3  5774 TRUE   FALSE    FALSE   
## 10 CND03580  0.960      4          3  5379 TRUE   FALSE    TRUE    
## # ... with 40 more rows

Plot counts for some genes with anti-correlated TSS usage

## # A tibble: 339 x 8
##    Gene_name   cor2 nSites nLocations    Tot cdsAlt utr5Altx utr3Altx
##    <chr>      <dbl>  <int>      <int>  <dbl> <lgl>  <lgl>    <lgl>   
##  1 CNA08250  -0.694      2          1 170096 FALSE  TRUE     FALSE   
##  2 CNF04420  -0.623      2          1 138837 FALSE  TRUE     FALSE   
##  3 CNN00330  -0.632      2          1  93619 FALSE  TRUE     FALSE   
##  4 CNL04650  -0.641      2          1  78263 TRUE   FALSE    FALSE   
##  5 CNN00340  -0.727      2          1  66423 FALSE  TRUE     FALSE   
##  6 CNC05950  -0.501      2          2  30570 TRUE   FALSE    FALSE   
##  7 CNI01470  -0.510      2          2  23690 TRUE   FALSE    FALSE   
##  8 CNA01370  -0.674      2          1  22363 FALSE  TRUE     FALSE   
##  9 CNA04020  -0.688      2          2  17303 FALSE  FALSE    FALSE   
## 10 CNA01070  -0.583      2          2  17268 TRUE   FALSE    FALSE   
## # ... with 329 more rows

Correlation for core 4 conditions, (30C, 37C) x (Expo, Stat)

Highly-expressed genes have more highly correlated TSS counts

Sample correlated and anti-correlated TSS

For CNC06940/RPL15, correlated. CNF04420/CAR1, anti-correlated

JEC21, Statistical model for differential alternate TSS usage between conditions

Model description

This model focuses only on the relative usage of different TSS’s changing between conditions. It does this by factoring out changes in the gene-level usage, within a statistical model incorporating count noise.

For each gene, we take a negative binomial distribution of counts \(x_{csi}\) at each TSS (site, \(s\)), in each condition (\(t\)), for each replicate \(i\): \[x_{sti} \sim Negbin( \exp( \alpha_{ti} + \beta_{ts} ), \theta )\]

Here the coefficients \(\alpha\) track different total counts between replicates for each gene. The \(\beta\) coefficients track different usage between sites at each condition.

The dispersion parameter \(\theta\) estimates how much more dispersed the counts are relative to a poisson model. A \(NegBin(\mu, \theta)\) variable has mean \(\mu\) and variance \(\mu + \mu^2/\theta\). We use the implementation of negative binomial with log link function from the MASS package’s negative.binomial function, and the glm function in base R to fit.

To estimate the value of \(\theta\) from the whole dataset, we estimate overdispersion of counts from 10 random subsets of single-site genes across conditions.

Later, to control the false discovery rate after testing thousands of genes, we will use the Benjamini-Hochberg criterion applied to the p-values of coefficients.

Example data

The data look like this, for a sample gene CNA00250/RPL9, so 12 data points = 2 sites x 2 conditions x 3 replicates.

## # A tibble: 12 x 5
##    Gene_name Site  Condition Rep   Count
##    <chr>     <fct> <fct>     <fct> <dbl>
##  1 CNA00250  1     Expo30    a      4460
##  2 CNA00250  2     Expo30    a       503
##  3 CNA00250  1     Expo30    b      2747
##  4 CNA00250  2     Expo30    b       207
##  5 CNA00250  1     Expo30    c      2271
##  6 CNA00250  2     Expo30    c       143
##  7 CNA00250  1     Expo37    a      1011
##  8 CNA00250  2     Expo37    a       328
##  9 CNA00250  1     Expo37    b       765
## 10 CNA00250  2     Expo37    b       221
## 11 CNA00250  1     Expo37    c       883
## 12 CNA00250  2     Expo37    c       272

Example fit

The fit to this example data has coefficients (conditions 30, 37 degrees centigrade; replicates a,b,c; sites 1,2). We give the names of the coefficients both in accordance with the model above, and also as automatically assigned by R’s glm function.

## # A tibble: 8 x 5
##   coeff               term                  estimate std.error     p.value
##   <chr>               <chr>                    <dbl>     <dbl>       <dbl>
## 1 a_{30,a}            (Intercept)              8.56      0.118 0.000000216
## 2 b_{30,2}            Site2                   -2.49      0.124 0.0000364  
## 3 a_{37,a} - a_{30,a} ConditionExpo37         -1.61      0.169 0.000672   
## 4 a_{30,b} - a_{30,a} ConditionExpo30:Repb    -0.677     0.149 0.0104     
## 5 a_{37,b} - a_{30,a} ConditionExpo37:Repb    -0.331     0.153 0.0957     
## 6 a_{30,c} - a_{30,a} ConditionExpo30:Repc    -0.932     0.151 0.00346    
## 7 a_{37,c} - a_{30,a} ConditionExpo37:Repc    -0.160     0.151 0.351      
## 8 b_{37,2} - b_{30,2} ConditionExpo37:Site2    1.31      0.176 0.00174

The key question for differential alt usage is if \[ \beta_{37,2} - \beta_{30,2} \] i.e. ConditionExpo37:Site2, is significant and substantial. For this gene, the difference is big, as we can see from the graph, and significant as we see from the p value of 0.0017.

The fit has 8 coefficients for 12 data points, so 4 degrees of freedom:

## # A tibble: 1 x 7
##   null.deviance df.null logLik   AIC   BIC deviance df.residual
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int>
## 1         1072.      11  -66.5  149.  153.     6.80           4

Estimate theta (negative binomial dispersion) for 10 subsets of 50 genes

Fit negative binomial model to counts for each alt-TSS gene separately

Alternate TSS usage at Expo 30

p-values from this are nowhere near uniformly distributed

This suggests that most alt sites are either very differentially used or very similarly used; there’s not much middle ground.

Volcano plot shows great variety in alt TSS usage

Alternate TSS usage at Expo 37C vs 30C

p-values from this are nowhere near uniformly distributed either

Volcano plot with FDR control

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 857 x 5
##    Gene_name term                     Tot log10abs     p.value
##    <chr>     <chr>                  <dbl>    <dbl>       <dbl>
##  1 CNA08250  ConditionExpo37:Site2 170096    0.771 0.0000675  
##  2 CNN00330  ConditionExpo37:Site2  93619    3.20  0.0000426  
##  3 CNK02240  ConditionExpo37:Site2  72448    0.423 0.000217   
##  4 CNA04340  ConditionExpo37:Site2  54224    1.50  0.00000913 
##  5 CNA04920  ConditionExpo37:Site2  45005    0.666 0.000714   
##  6 CNK02330  ConditionExpo37:Site2  44647    0.825 0.00000872 
##  7 CNB02670  ConditionExpo37:Site2  42510    1.86  0.000000886
##  8 CNB02670  ConditionExpo37:Site3  42510    0.608 0.000459   
##  9 CNC06180  ConditionExpo37:Site2  31787    1.47  0.000632   
## 10 CNA06460  ConditionExpo37:Site2  31648    0.699 0.000777   
## # ... with 847 more rows

Stat model indeed identifies TSS’s that are differentially used

Alternate TSS usage at Stat vs Expo at 30C

p-values from this are nowhere near uniformly distributed either

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 1,144 x 5
##    Gene_name term                     Tot log10abs     p.value
##    <chr>     <chr>                  <dbl>    <dbl>       <dbl>
##  1 CNA08250  ConditionStat30:Site2 170096    1.54  0.0000379  
##  2 CNF04420  ConditionStat30:Site2 138837    2.23  0.0000480  
##  3 CNJ00950  ConditionStat30:Site2  78481    0.799 0.000000218
##  4 CNJ00950  ConditionStat30:Site3  78481    0.708 0.000000565
##  5 CNJ00950  ConditionStat30:Site4  78481    0.432 0.0000714  
##  6 CNM02200  ConditionStat30:Site2  69744    1.55  0.0000342  
##  7 CNA04340  ConditionStat30:Site2  54224    0.667 0.000548   
##  8 CNK02330  ConditionStat30:Site2  44647    0.778 0.000140   
##  9 CNM02380  ConditionStat30:Site2  40807    0.912 0.0000309  
## 10 CNC04820  ConditionStat30:Site2  33913    0.969 0.000799   
## # ... with 1,134 more rows

Stat model indeed identifies TSS’s that are differentially used

Stat37 vs Stat30

Volcano plot with FDR control

Top hits of stat model

## # A tibble: 1,882 x 5
##    Gene_name term                    Tot log10abs   p.value
##    <chr>     <chr>                 <dbl>    <dbl>     <dbl>
##  1 CNJ00410  ConditionStat37:Site2 51084    0.977 0.000343 
##  2 CNL05140  ConditionStat37:Site2 45132    0.307 0.00167  
##  3 CNK02330  ConditionStat37:Site2 44647    0.905 0.000401 
##  4 CNB02670  ConditionStat37:Site2 42510    0.534 0.00118  
##  5 CNA00170  ConditionStat37:Site2 39848    0.734 0.000955 
##  6 CNC06180  ConditionStat37:Site2 31787    1.52  0.000310 
##  7 CNA06460  ConditionStat37:Site2 31648    0.477 0.0000364
##  8 CNL04750  ConditionStat37:Site2 24597    0.745 0.000433 
##  9 CNC02870  ConditionStat37:Site2 24045    0.474 0.0000336
## 10 CNM01710  ConditionStat37:Site2 22342    0.831 0.0000411
## # ... with 1,872 more rows

Stat model indeed identifies TSS’s that are differentially used

Stat37 vs Expo37

Volcano plot with FDR control

Top hits of stat model

## # A tibble: 1,948 x 5
##    Gene_name term                     Tot log10abs     p.value
##    <chr>     <chr>                  <dbl>    <dbl>       <dbl>
##  1 CNA08250  ConditionStat37:Site2 170096    2.28  0.000000832
##  2 CNF04420  ConditionStat37:Site2 138837    2.34  0.0000306  
##  3 CNN00330  ConditionStat37:Site2  93619    3.36  0.000169   
##  4 CNK02310  ConditionStat37:Site2  84186    1.03  0.000232   
##  5 CNM02200  ConditionStat37:Site2  69744    0.993 0.000106   
##  6 CNA04340  ConditionStat37:Site2  54224    1.51  0.000287   
##  7 CND04780  ConditionStat37:Site2  48791    1.05  0.000210   
##  8 CNF02040  ConditionStat37:Site2  45927    0.602 0.00156    
##  9 CND01440  ConditionStat37:Site2  45905    0.617 0.00114    
## 10 CNK02330  ConditionStat37:Site2  44647    0.699 0.000751   
## # ... with 1,938 more rows

Stat model indeed identifies TSS’s that are differentially used

Calculate alternate TSS usage at upf1 vs WT

Alternate TSS usage at upf1 vs WT, Expo 30C

p-values from this are nowhere near uniformly distributed either

Volcano plot with FDR control

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 737 x 5
##    Gene_name term                   Tot log10abs     p.value
##    <chr>     <chr>                <dbl>    <dbl>       <dbl>
##  1 CNA08250  Conditionupf1:Site2 170096    0.988 0.0000856  
##  2 CNB02670  Conditionupf1:Site3  42510    0.331 0.000634   
##  3 CNM02380  Conditionupf1:Site2  40807    0.985 0.000207   
##  4 CNA05220  Conditionupf1:Site2  29490    1.48  0.000142   
##  5 CNK03330  Conditionupf1:Site2  27867    0.820 0.000660   
##  6 CNL04750  Conditionupf1:Site3  24597    1.29  0.0000734  
##  7 CNG03220  Conditionupf1:Site3  24505    0.559 0.000204   
##  8 CNG01250  Conditionupf1:Site3  23108    1.43  0.000000106
##  9 CNB02690  Conditionupf1:Site2  22186    0.757 0.000100   
## 10 CNN00420  Conditionupf1:Site2  22026   10.9   0.0000756  
## # ... with 727 more rows

Stat model indeed identifies TSS’s that are differentially used

Alternate TSS usage at upf1 vs WT, Expo 37C

Volcano plot with FDR control

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 729 x 5
##    Gene_name term                  Tot log10abs     p.value
##    <chr>     <chr>               <dbl>    <dbl>       <dbl>
##  1 CNN00330  Conditionupf1:Site2 93619    2.46  0.000108   
##  2 CNA04340  Conditionupf1:Site2 54224    0.988 0.000251   
##  3 CNB02670  Conditionupf1:Site2 42510    0.719 0.000337   
##  4 CNB02670  Conditionupf1:Site3 42510    0.867 0.000162   
##  5 CNA00250  Conditionupf1:Site2 32405    0.634 0.000177   
##  6 CNA06460  Conditionupf1:Site2 31648    0.698 0.000251   
##  7 CNF00330  Conditionupf1:Site2 25959    1.28  0.000651   
##  8 CNG03220  Conditionupf1:Site3 24505    0.474 0.000106   
##  9 CNC02870  Conditionupf1:Site2 24045    0.817 0.0000215  
## 10 CNJ03160  Conditionupf1:Site2 23055   21.7   0.000000970
## # ... with 719 more rows

Stat model indeed identifies TSS’s that are differentially used

Alternate TSS usage at upf1 vs WT, Stat 30C

p-values from this are nowhere near uniformly distributed either

Volcano plot with FDR control

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 1,164 x 5
##    Gene_name term                   Tot log10abs  p.value
##    <chr>     <chr>                <dbl>    <dbl>    <dbl>
##  1 CNG00830  Conditionupf1:Site2 175616    0.717 4.60e- 4
##  2 CNA08250  Conditionupf1:Site2 170096    1.68  6.33e- 5
##  3 CNF04420  Conditionupf1:Site2 138837    2.34  3.53e- 5
##  4 CNJ00950  Conditionupf1:Site2  78481    0.449 3.20e-10
##  5 CNJ00950  Conditionupf1:Site3  78481    0.508 6.33e-11
##  6 CNM02200  Conditionupf1:Site2  69744    1.70  3.94e- 6
##  7 CNA04360  Conditionupf1:Site2  51497    0.380 6.30e- 4
##  8 CNL05140  Conditionupf1:Site2  45132    0.462 2.54e- 4
##  9 CNM02380  Conditionupf1:Site2  40807    0.880 9.74e- 4
## 10 CNC06180  Conditionupf1:Site2  31787    1.03  4.56e- 4
## # ... with 1,154 more rows

Stat model indeed identifies TSS’s that are differentially used

Alternate conditions, upf1 strain

Alternate TSS usage at Expo 37C vs 30C, upf1

p-values from this are nowhere near uniformly distributed either

Volcano plot with FDR control

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 822 x 5
##    Gene_name term                    Tot log10abs    p.value
##    <chr>     <chr>                 <dbl>    <dbl>      <dbl>
##  1 CND04780  ConditionExpo37:Site2 48791    2.12  0.000234  
##  2 CNB02670  ConditionExpo37:Site2 42510    1.87  0.00000502
##  3 CNB02670  ConditionExpo37:Site3 42510    0.590 0.0000508 
##  4 CNA00170  ConditionExpo37:Site2 39848    0.761 0.000319  
##  5 CNK03330  ConditionExpo37:Site2 27867    1.11  0.0000384 
##  6 CNG03220  ConditionExpo37:Site3 24505    0.601 0.0000732 
##  7 CNC02870  ConditionExpo37:Site2 24045    0.476 0.000449  
##  8 CNN00420  ConditionExpo37:Site2 22026   10.9   0.0000756 
##  9 CNK00060  ConditionExpo37:Site2 21117    1.99  0.00000848
## 10 CNC05790  ConditionExpo37:Site2 20518    1.55  0.0000199 
## # ... with 812 more rows

Stat model indeed identifies TSS’s that are differentially used

Alternate TSS usage at Stat vs Expo at 30C, upf1

p-values from this are nowhere near uniformly distributed either

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 1,118 x 5
##    Gene_name term                     Tot log10abs    p.value
##    <chr>     <chr>                  <dbl>    <dbl>      <dbl>
##  1 CNA08250  ConditionStat30:Site2 170096    0.841 0.000375  
##  2 CNK02310  ConditionStat30:Site2  84186    0.983 0.000127  
##  3 CNJ00950  ConditionStat30:Site4  78481    0.519 0.0000288 
##  4 CNL04650  ConditionStat30:Site2  78263    1.77  0.00000728
##  5 CND01200  ConditionStat30:Site2  64717    0.741 0.000341  
##  6 CNB01800  ConditionStat30:Site2  64334    0.685 0.000536  
##  7 CNA04360  ConditionStat30:Site2  51497    0.767 0.000352  
##  8 CNF02040  ConditionStat30:Site2  45927    0.603 0.0000779 
##  9 CNB02670  ConditionStat30:Site2  42510    1.15  0.000103  
## 10 CNG00970  ConditionStat30:Site2  41577    0.912 0.0000257 
## # ... with 1,108 more rows

Stat model indeed identifies TSS’s that are differentially used

H99

Load TSS cts

Plot number of distinct counts (and genomic locations) by Gene name

Correlation analysis

There is a wide range of correlations between the top-2 most-used TSS’s

There are many negative correlations.

Plot counts for some genes with Highly-correlated TSS usage

## # A tibble: 102 x 8
##    Gene_name   cor2 nSites nLocations    Tot cdsAlt utr5Altx utr3Altx
##    <chr>      <dbl>  <int>      <int>  <dbl> <lgl>  <lgl>    <lgl>   
##  1 CNAG_06125 0.960      3          2 277167 TRUE   FALSE    FALSE   
##  2 CNAG_06113 0.954      2          1  33203 FALSE  TRUE     FALSE   
##  3 CNAG_06835 0.976      2          2  30668 TRUE   FALSE    FALSE   
##  4 CNAG_00992 0.969      2          2  30303 TRUE   FALSE    FALSE   
##  5 CNAG_00370 0.967      2          2  25042 TRUE   FALSE    FALSE   
##  6 CNAG_01230 0.970      3          2  20703 TRUE   FALSE    FALSE   
##  7 CNAG_06208 0.968      2          2  18103 TRUE   FALSE    FALSE   
##  8 CNAG_02100 0.965      2          2  17802 TRUE   FALSE    FALSE   
##  9 CNAG_05800 0.960      2          2  15041 TRUE   FALSE    FALSE   
## 10 CNAG_03577 0.960      3          3  14689 TRUE   FALSE    FALSE   
## # ... with 92 more rows

Plot counts for some genes with anti-correlated TSS usage

## # A tibble: 813 x 8
##    Gene_name    cor2 nSites nLocations   Tot cdsAlt utr5Altx utr3Altx
##    <chr>       <dbl>  <int>      <int> <dbl> <lgl>  <lgl>    <lgl>   
##  1 CNAG_04388 -0.643      2          2 23624 TRUE   FALSE    FALSE   
##  2 CNAG_03688 -0.549      2          2 23549 TRUE   FALSE    FALSE   
##  3 CNAG_04601 -0.542      2          1 21648 FALSE  TRUE     FALSE   
##  4 CNAG_01019 -0.530      2          2 20799 TRUE   FALSE    FALSE   
##  5 CNAG_05980 -0.859      2          1 16853 FALSE  TRUE     FALSE   
##  6 CNAG_04904 -0.580      2          2 14356 TRUE   FALSE    FALSE   
##  7 CNAG_05465 -0.524      2          2 14279 TRUE   FALSE    FALSE   
##  8 CNAG_06120 -0.687      3          3 13480 TRUE   FALSE    FALSE   
##  9 CNAG_05408 -0.710      2          2 12134 TRUE   FALSE    FALSE   
## 10 CNAG_06577 -0.515      2          2 12054 TRUE   FALSE    FALSE   
## # ... with 803 more rows

CNAG_01019, Superoxide dismutase with role in ROS metabolism, interesting!

H99, Stat model to quantify differential TSS usage

Scratch code comparing Poisson and negative binomial models.

Some code that I used to test and set up the models is here

The code in the next chunk showed that fitting a single model across many hundreds/thousands of genes would be prohibitively slow to fit. In principle it would be possible to recode as a hierarchical model with information pooled between genes (e.g. dispersion \(\theta\)). However, it was more efficient to estimate \(\theta\) from some subsamples, and then fit each gene separately, the approach we took in the end.

Correlation for core 4 conditions, (30C, 37C) x (Expo, Stat)

Highly-expressed genes have more highly correlated TSS counts

H99, Statistical model for alternative TSS usage between conditions

This is the model we actually used, as explained above under JEC21.

Estimate theta (negative binomial dispersion) for 10 subsets of 50 genes

Fit negative binomial model to counts for each alt-TSS gene separately

Negative binomial is an extension of Poisson model with extra parameter theta; NegBin(mu, theta) has mean mu and variance mu + mu^2/theta.

Alternate TSS usage at Expo 30

p-values from this are nowhere near uniformly distributed

This suggests that most alt sites are either very differentially used or very similarly used; there’s not much middle ground.

Volcano plot shows great variety in alt TSS usage

Alternate TSS usage at Expo 37C vs 30C

p-values from this are nowhere near uniformly distributed either

Volcano plot with FDR control

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 798 x 5
##    Gene_name  term                    Tot log10abs     p.value
##    <chr>      <chr>                 <dbl>    <dbl>       <dbl>
##  1 CNAG_01990 ConditionExpo37:Site2 39447    0.426 0.0000733  
##  2 CNAG_05774 ConditionExpo37:Site2 27812    0.401 0.000233   
##  3 CNAG_04585 ConditionExpo37:Site2 23225    0.919 0.0000483  
##  4 CNAG_01230 ConditionExpo37:Site3 20703    0.881 0.000144   
##  5 CNAG_00995 ConditionExpo37:Site2 20613    1.94  0.000440   
##  6 CNAG_03492 ConditionExpo37:Site2 19717    2.12  0.000651   
##  7 CNAG_02527 ConditionExpo37:Site2 15437    0.955 0.000368   
##  8 CNAG_02527 ConditionExpo37:Site3 15437    0.849 0.000631   
##  9 CNAG_01043 ConditionExpo37:Site2 14653    0.516 0.000163   
## 10 CNAG_06120 ConditionExpo37:Site2 13480    1.45  0.000000239
## # ... with 788 more rows

Stat model indeed identifies TSS’s that are differentially used

Alternate TSS usage at Stat vs Expo at 30C

p-values from this are nowhere near uniformly distributed either

Coloured for significance at 1% FDR, Benjamini-Hochberg.

Top hits of stat model

## # A tibble: 723 x 5
##    Gene_name  term                    Tot log10abs     p.value
##    <chr>      <chr>                 <dbl>    <dbl>       <dbl>
##  1 CNAG_04388 ConditionStat30:Site2 23624    0.864 0.000599   
##  2 CNAG_01019 ConditionStat30:Site2 20799    1.98  0.000000621
##  3 CNAG_01230 ConditionStat30:Site3 20703    0.921 0.000126   
##  4 CNAG_02100 ConditionStat30:Site2 17802    0.215 0.0000951  
##  5 CNAG_00457 ConditionStat30:Site2 17646    0.796 0.0000437  
##  6 CNAG_05980 ConditionStat30:Site2 16853   13.6   0.000000389
##  7 CNAG_02880 ConditionStat30:Site2 16214    0.478 0.000449   
##  8 CNAG_06120 ConditionStat30:Site2 13480    1.80  0.0000520  
##  9 CNAG_06120 ConditionStat30:Site3 13480    1.63  0.0000112  
## 10 CNAG_02500 ConditionStat30:Site2 13357    0.573 0.000544   
## # ... with 713 more rows

Stat model indeed identifies TSS’s that are differentially used

- CNAG_04245, chitinase.
- CNAG_06817, UAP cation symporter

Stat37 vs Stat30

Volcano plot with FDR control

Top hits of stat model

## # A tibble: 1,714 x 5
##    Gene_name  term                     Tot log10abs    p.value
##    <chr>      <chr>                  <dbl>    <dbl>      <dbl>
##  1 CNAG_06125 ConditionStat37:Site2 277167    0.269 0.00157   
##  2 CNAG_06125 ConditionStat37:Site3 277167    0.280 0.00115   
##  3 CNAG_01884 ConditionStat37:Site2  51358    0.810 0.000611  
##  4 CNAG_06238 ConditionStat37:Site2  50468    0.452 0.000358  
##  5 CNAG_01486 ConditionStat37:Site2  45953    0.388 0.000163  
##  6 CNAG_01486 ConditionStat37:Site3  45953    0.662 0.00000708
##  7 CNAG_01990 ConditionStat37:Site2  39447    0.702 0.000344  
##  8 CNAG_04659 ConditionStat37:Site2  35362    0.334 0.00188   
##  9 CNAG_06113 ConditionStat37:Site2  33203    0.317 0.000179  
## 10 CNAG_05521 ConditionStat37:Site2  29217    0.366 0.000249  
## # ... with 1,704 more rows

Stat model indeed identifies TSS’s that are differentially used

  • CNAG_06125, Heat Shock Protein HSP90
  • CNAG_06238, Glutathione S-Transferase

Stat37 vs Expo37

Volcano plot with FDR control

Top hits of stat model

## # A tibble: 1,659 x 5
##    Gene_name  term                     Tot log10abs    p.value
##    <chr>      <chr>                  <dbl>    <dbl>      <dbl>
##  1 CNAG_04004 ConditionStat37:Site2 106287    0.539 0.000592  
##  2 CNAG_04794 ConditionStat37:Site2  94099    0.959 0.00000909
##  3 CNAG_01884 ConditionStat37:Site2  51358    0.792 0.0000611 
##  4 CNAG_06238 ConditionStat37:Site2  50468    0.459 0.00158   
##  5 CNAG_06541 ConditionStat37:Site2  49505    0.450 0.000173  
##  6 CNAG_01486 ConditionStat37:Site3  45953    0.534 0.0000201 
##  7 CNAG_01990 ConditionStat37:Site2  39447    0.526 0.0000660 
##  8 CNAG_00091 ConditionStat37:Site2  37223    0.501 0.00115   
##  9 CNAG_06113 ConditionStat37:Site2  33203    0.531 0.000725  
## 10 CNAG_04021 ConditionStat37:Site2  33106    0.576 0.00163   
## # ... with 1,649 more rows

Stat model indeed identifies TSS’s that are differentially used